CSDA Lab, Mathematics and Statistics Department, University of West Florida
High dimensional data refers to data with large number of features (co-variates) \(p\), formally we can write data is \(\mathbf{X} \in \mathbb{R} ^{n\times p}\):
\[ p \gg n, \tag{1} \] where \(n\) is the number of observations.
In this context, many challenges arise:
Some solutions in the literature:
Our data is a mass spectrum signal data (functional data).
The Fourier Transform of a signal \(x(t)\) can be expressed as:
\[ X(f)= \int_{-\infty}^{\infty} x(t) e^{i2 \pi ft} dt \tag{2} \](\(e^{ix}= \cos x + i \sin x\), Euler’s formula); \(f\) is the frequency domain.
The Wavelet Transform of a signal \(x(t)\) can be given as:
\[ WT(s,\tau)= \frac{1}{\sqrt s}\int_{-\infty}^{\infty} x(t) \psi^*\big(\frac{t-\tau}{s}\big) dt, \tag{3} \]
where \(\psi^*(t)\) denotes the complex conjugate of the base wavelet \(\psi(t)\)); \(s\) is the scaling parameter, and \(\tau\) is the location parameter.
Example: Morlet Wavelet \(\psi(t) = e^{i2 \pi f_0t} e^{-(\alpha t^2/\beta^2)}\), with the parameters \(f_0\), \(\alpha\), \(\beta\) all being constants.
Ovarian cancer detection (Yu et al. 2005): A combination of Binning, Kolmogorov-Smirnov test, discrete wavelet transform, and support vector machines.
Proteomic profile with bi-orthogonal discrete wavelet transform (Schleif et al. 2009): A combination of outlier detection-centroied-based, recalibration, baseline correction (top-hat filter), Kolmogorov-Smirnov test, discrete wavelet transform bior 3.7, and support vector machines.
Ovarian cancer detection using peaks and discrete wavelet transform (Du et al. 2009): A combination of discrete wavelet transform, thresholding, peak detection using MAD, Kolmogorov-Smirnov test, bagging predictor.
Ovarian cancer classification using wavelets and genetic algorithm (Nguyen et al. 2015): A combination of Haar discrete wavelet transform, genetic algorithms.
Breat cancer mass spectrum classification (Cohen, Messaoudi, and Badir 2018): A combination of segmentation, discrete wavelet transform, statistical features on the coefficients, PCA-T2 Hotelling statistic, SVM.
Ovarian cancer mass spectrum classification (Vimalajeewa, Bruce, and Vidakovic 2023): A combination of Daubechies-7 wavelet transform, sample variance and distance variance, Fisher’s criterion for feature extraction, SVM, KNN, and Logistic regression.
A workflow for ML is the following:
Data Collection
Data Processing: Clean, Explore, Prepare, Transform
Modeling: Develop, Train, Validate, and Evaluate,
Deployment: Deploy, Monitor and Update
Go to 1.
We designed a statistical experiment to evaluation 4 different processing approaches.
Variables of the experimental design:
Four pre-processing techniques.
5 window sizes.
Two are wavelet-based and two are not.
10 wavelets families
Four ML Models: Logistic Regression, Support Vector Machine, Random Forest, and XGboost.
Two sampling: up and no sampling to overcome the imbalance classes
Repeat 100 times each case.
A total of 88000 models were run.
Processing 1 (PROC1): The feature space includes mean, variance, energy, coefficient of variation, Skewness, and Kurtosis; wavelet transform.
Processing 2 (PROC2): Same as PROC1 but the feature space will include the first 10 autocorrelation coefficients.
Processing 3 (PROC3): Same as PROC1 but without the wavelet transform.
Processing 4 (PROC4): Same as PROC2 but without the wavelet transform.
The performance metrics utilized were:
Observed 32,768 m/z values / 33,885 m/z values
Link: https://bioinformatics.mdanderson.org/public-datasets/
Performance across Processing
Characteristic |
PROC1 |
PROC2 |
PROC3 |
PROC4 |
||||
|---|---|---|---|---|---|---|---|---|
Nosampling |
Upsampling |
Nosampling |
Upsampling |
Nosampling |
Upsampling |
Nosampling |
Upsampling |
|
| precision | 0.70 ± 0.10 (0.40,1.00) | 0.70 ± 0.10 (0.40,1.00) | 0.71 ± 0.12 (0.40,1.00) | 0.71 ± 0.12 (0.40,1.00) | 0.78 ± 0.12 (0.57,1.00) | 0.78 ± 0.12 (0.50,1.00) | 0.73 ± 0.10 (0.50,1.00) | 0.73 ± 0.10 (0.50,1.00) |
| recall | 0.88 ± 0.16 (0.40,1.00) | 0.88 ± 0.16 (0.40,1.00) | 0.86 ± 0.17 (0.40,1.00) | 0.86 ± 0.17 (0.40,1.00) | 0.91 ± 0.12 (0.60,1.00) | 0.91 ± 0.12 (0.60,1.00) | 0.88 ± 0.17 (0.40,1.00) | 0.88 ± 0.17 (0.40,1.00) |
| F1.score | 0.77 ± 0.10 (0.40,1.00) | 0.77 ± 0.10 (0.40,1.00) | 0.77 ± 0.10 (0.40,1.00) | 0.77 ± 0.11 (0.40,1.00) | 0.83 ± 0.10 (0.60,1.00) | 0.83 ± 0.10 (0.55,1.00) | 0.79 ± 0.10 (0.44,0.91) | 0.79 ± 0.10 (0.44,0.91) |
| accuracy | 0.68 ± 0.13 (0.25,1.00) | 0.68 ± 0.13 (0.25,1.00) | 0.68 ± 0.13 (0.25,1.00) | 0.68 ± 0.13 (0.25,1.00) | 0.77 ± 0.13 (0.50,1.00) | 0.77 ± 0.13 (0.38,1.00) | 0.71 ± 0.12 (0.38,0.88) | 0.71 ± 0.12 (0.38,0.88) |
| 1
Mean ± SD (Min,Max) |
||||||||
All PROC seems to perform similarly across!
Performance across window sizes
Joint Mathematics Meetings | Jan 8-11, 2025 | Seattle